Extracting Practical Observations from Impractical Datasets

    Published: 2025-01-16. Last Updated: 2025-01-16 02:43:03 UTC
    by Curtis Dibble, SANS.edu BACS Student (Version: 1)
    0 comment(s)

    [This is a Guest Diary by Curtis Dibble, an ISC intern as part of the SANS.edu BACS [1] program]


    Figure 1: A heatmap showing the date and frequency a given set of commands input to the honeypot

    Spoiler alert, sugar costs money, and syntactic sugar is the most expensive type. Fortunately, we live in an era where a developer’s laptop tends to come with resources unheard of in the server realm a decade ago [2] and a vibrant open-source community to make use of all that shiny new silicon. All that sweet, sweet, free and open source software living under the umbrella of the MIT and GNU Affero General Public License (commonly known as the AGPL) license agreements is a godsend to both the amateur and professional cyber practitioner - until it isn’t. Which brings us to the crux of the issue and why this article exists in the first place. Smarter and more experienced people than myself have been writing and discussing this for the last twenty five years or more, but the simplest explanation is that cybersecurity as an industry suffers from a relationship with Not-Invented-Here Syndrome [3][4] that’s dysfunctional enough to warrant its own definition in the DSM-V (Diagnostic and Statistical Manual of Mental Disorders v5). A significant portion of that stems from the wildly different origin stories of practitioners. From administrative professionals to software engineers, network admins to high school dropouts, the value of their contributions lies in the varied life experience and broad spectrum of perspectives. The conflict, however, is the age old problem of buy-vs-build.

    To lay out the problem with examples, I’ll start with my own experience. Folks without extensive programming history tend to reach for the ‘buy’ or ‘off the shelf’ button to solve a problem, while anyone that’s written enough scripts (myself included…) will have a text editor with some boilerplate ready faster than Shodan can advertise unprotected RDP (Remote Desktop Protocol) access. Using off the shelf open source is usually the right answer - if for no other reason than to see whether an idea has merit. At some point though, the internals need customization and optimization to handle the problem, or open up options to integrate with other tools for deeper insights and fresh opportunities. To make that a little more concrete, I have nothing but praise for David Fitzmaurice and the work he’s done to weld together Zeek, Snort, and Grafana [5] with a little help from ChatGPT to dial it in. This is the right answer. Use best in breed tooling where you can, and tailor the important parts to the mission. Scott Jensen did exactly this with the DShield-SIEM project [6][7] and leaned hard on Elastic’s ELK (Elasticsearch, Logstash, Kibana) stack [8] as the implementation toolkit. Getting a working version of an idea off the ground is one of the hardest things around, but once it’s gained traction, the problem shifts to one of maintaining scope. Again, as a concrete example Joel Spolsky [2] made the point that Excel had a bespoke C compiler. Which, while that limits the scope, flexibility, and accessibility, it means that at one point, the internals of the bane of modern work were incredibly impressive before becoming just another fancy XML parser.

    Thoughts and Background

    Expanding on that, we have plenty of tools, one-offs, and live with the near axiomatic fact that there are no one-size-fits-all solutions in this arena. ELK stack is mentioned often enough as the standard for security or general purpose analytics work. The problem with this - and the state of SIEM (Security Information and Event Management) or SOAR (Security Orchestration Automation and Response) solutions in general - is that storage, compute, and memory have become cheap enough to leave data in the native format which it was collected, and then reparse on update or cache results in temporary storage. I’ve dealt with this personally using the IBM QRadar [9] platform, waiting patiently for it to skim and then collate results from dozens of terabytes of flat files using regular expressions. That wouldn’t be an issue if it was at least fast for the end user, and scaling didn’t imply a choice between massive capital expenditures for more licenses and hardware vs paying for additional cloud services. At the end of the day, this is one  root cause of issues in the Information Age. In other words, there are professional tools that exist and scale, they are prohibitively expensive for all but the largest players, the only solution anyone is willing to try at scale involves throwing more servers at the problem, and nobody is interesting in solving these combined issues. I built the parts included in this article and published the repository at the end to act as a foundation for practitioners to at least try something different.

    Moving on

    Admittedly, I have a few hesitations when using Docker-based products after dealing with them in production, which is one of the reasons for steering clear of the existing DShield-SIEM solution. I also wanted to get more out of the internship than simply observing and reporting. So, using the tools that I knew, I set off to see if it was worth reinventing the wheel, or at least to figure out if there was a better way forward. In the process, I had to apply several years of experiments across different fields to an old problem, and something novel seems to have come out the other end. Fair warning, I am by no stretch a master of SQL. At best, I am a semi-literate student of the language with access to a deep library. To get to this point required extensive study of Joe Celko’s work [10][11], a lot of experimentation, too much time reading SQL Server documentation [12], more Google-fu than I’m willing to admit. Standing up a prototype also meant having significant personal infrastructure to fail relentlessly forward. Fortunately, the slimmed down version is laptop friendly.
     


    Figure 2: 60% of the time it happens every time. SQL Server and proper database management systems are memory hogs not meant for consumer hardware. Unless you set the proper constraints and configurations, of course.

    This project never would have forced me to face certain scaling issues without a little help from Jesse La Grew and a massive data dump of logs he provided. After that contribution, I ended up with an additional 11GB of Cowrie, Firewall, and Web/Honeypot logs to work with. There are over 300GB+ of compressed PCAP capture files included, but I have not had the time to create a decent model for them. Onto the meat and potatoes.

    The first problem I solved wound up being the key to the entire structure – data deduplication and type-casting. In the end, an IP Address as we recognize it is nothing more than a formatted decimal representation of a binary string with a predefined length. This applies to IPv4 just as much as it does to IPv6 - or even an MD5 hash defining a session identifier. As it turns out, both open source and commercial implementations of SQL are highly optimized to process binary data, and come with extensive libraries of purpose built functions to do precisely that. Using in-situ conversion between VARCHAR(45) and VARBINARY(16) storage, I was able to build a lookup table that can be cached, searched, and indexed to a simple, 2-byte, integer identifier, with the mechanics hidden behind a series of stored procedures and views, storing and comparing both IPv4 and IPv6 in the same column, using a single interface to interpolate both, while keeping this activity completely invisible to the end user.

    Type (Length) Storage (Bytes) Max Size (Bytes) Note
    VARCHAR(45) D + 2 = 45 45 Non-Unicode, 1 Byte/Character
    NVARCHAR(MAX) D + P  2 x 1024**3 (GB) LOB for values over 8KB
    NVARCHAR(45) 2 x C + 2 = 90 90 Unicode, 2 Bytes/Character
    VARCHAR(MAX) D + P (if off-row) 2 x 1024**3 (GB) LOB for values over 8KB
    BINARY(16) D = 16 16 No Metadata overhead, fixed size
    BINARY(MAX) D + P 2 x 1024**3 (GB) LOB for binary values over 8KB
    VARBINARY(16) D + 2 = 16 16 Binary, stores raw byte sequences
    VARBINARY(MAX) D + P (if off-row) 2 x 1024**3 (GB) LOB for binary values over 8KB

    Figure 3: Notes - LOB = Large Object, Storage: D = Data Length, P = Pointer, C = Character Count

     

    A quick breakdown of datatypes, storage sizes, and why it’s important to be specific. This is SQL Server specific, but should translate fairly well across different vendors.



    Figure 4: SSMS Query Execution: A wise man once told me: show, don’t tell. Here’s the query plan for an unoptimized view that will scan about 33.5 million rows, perform on-the-fly type conversions and comparisons of binary data to strings, and return approximately 1.4 million results.

     


    Figure 5: SSMS Query: The actual query and sample results showing various JOINS and transformations.

     


    Figure 6: Client Execution Statistics for SQL Server from this Query. It takes about 7 seconds when not displaying the plan, and this instance is restricted to 30% of laptop resources.

     

    I continued to apply this general idea to other portions of the Cowrie, Firewall, and Honeypot logs, and created a schema called Rolodex to contain the most common lookups. For a sense of scale, there are nearly 40 million log entries stored this way, and total database consumes 3600MB of disk space after vectorization using a hierarchy structure detailed by Joe Celko in his book on Trees and Hierarchies [11].


    Figure 7: Cowrie Logs Storage [Cowrie].[Logs] 

    Figure 8: Firewall Logs Storage [Logs].[Firewall]

    Figure 9: Honeypot Logs Storage [Logs].[Honeypot]

    Figure 10: SSH Logs Storage [Logs].[RawSSH]

     

    Log Table Row Count Data (MB) Index (MB) Compression?
    Cowrie Cowrie.Logs 4,485,242 177.891 .672 None
    Firewall Logs.Firewall 33,513,712 1462.719 1522.680 None
    Honeypot Logs.Honeypot 31,213 1.242 1.703 None
    SSH Logs.RawSSH 11,860 .430 .781 None

    Figure 11: Log table metadata. 

     

    The Code

    That’s enough talk. Knowing that code reuse is the holy grail of software engineering, I refactored a few portions into separate modules so that readers can walk away from this with something useful without taking a deep dive into the finer points of data science. There will be link to the GitHub repository in the references section. All code shown in this article is licensed under the GNU Affero General Public License, more commonly known as the AGPL. While this is plain SQL, it is targeted at Microsoft SQL Server 2019, but should be fairly easy to modify for use with an open source database such as PostgreSQL, libSQL/SQLite3, or other commonly available options. To avoid making any assumptions about user’s backgrounds, I’ll break down why this is important, and that might shed some light on why the every tech company worth its salt has spent billions of dollars to optimize Javascript over the last two decades. It should also explain part of how we wound up with Typescript, WebAssembly, React, and other technologies. That said, time to take a look under the hood!

    /*
     *  IP Address SQL Schema
     *  A schema to perform fast CRUD and analysis operations in plain SQL
     *  Copyright (C) 2025 Curtis Dibble
     *
     *  This program is free software: you can redistribute it and/or modify
     *  it under the terms of the GNU Affero General Public License as
     *  published by the Free Software Foundation, either version 3 of the
     *  License, or (at your option) any later version.
     *  
     *  This program is distributed in the hope that it will be useful,
     *  but WITHOUT ANY WARRANTY; without even the implied warranty of
     *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
     *  GNU Affero General Public License for more details.
     *  
     *  You should have received a copy of the GNU Affero General Public License
     *  along with this program.  If not, see <https://www.gnu.org/licenses/>.
     */
    -- =============================================
    -- Title:       	IPAddress - IP To String
    -- Author:		Curtis Dibble
    -- Date:		12/14/2024
    -- Schema:		IPAddress
    -- Type:		Scalar Function
    -- Description:
    --	Convert a (Single) VARBINARY(16) IPv4 or 
    --	IPv6 Address to a VARCHAR(45)
    --	Representation for display or transport
    -- =============================================
    
    CREATE FUNCTION [IPAddress].[FromBinary] (@IPBinary VARBINARY(16)) 
    RETURNS VARCHAR(45) AS
    BEGIN
    	DECLARE @Result VARCHAR(45) = '0.0.0.0';
    IF @IPBinary IS NULL RETURN NULL;
    ELSE IF DATALENGTH( @IPBinary ) = 4
    RETURN 
    CONVERT(VARCHAR(3), CONVERT(TINYINT, SUBSTRING(@IPBinary, 1, 1)))+'.'+
    CONVERT(VARCHAR(3), CONVERT(TINYINT, SUBSTRING(@IPBinary, 2, 1)))+'.'+
    CONVERT(VARCHAR(3), CONVERT(TINYINT, SUBSTRING(@IPBinary, 3, 1)))+'.'+
    CONVERT(VARCHAR(3), CONVERT(TINYINT, SUBSTRING(@IPBinary, 4, 1)));
    ELSE IF DATALENGTH( @IPBinary ) = 16
    RETURN 
    CONVERT(VARCHAR(4), SUBSTRING(@IPBinary,  1, 2), 2) + ':' +
    CONVERT(VARCHAR(4), SUBSTRING(@IPBinary,  3, 2), 2) + ':' +
    CONVERT(VARCHAR(4), SUBSTRING(@IPBinary,  5, 2), 2) + ':' +
    CONVERT(VARCHAR(4), SUBSTRING(@IPBinary,  7, 2), 2) + ':' +
    CONVERT(VARCHAR(4), SUBSTRING(@IPBinary,  9, 2), 2) + ':' +
    CONVERT(VARCHAR(4), SUBSTRING(@IPBinary, 11, 2), 2) + ':' +
    CONVERT(VARCHAR(4), SUBSTRING(@IPBinary, 13, 2), 2) + ':' +
    CONVERT(VARCHAR(4), SUBSTRING(@IPBinary, 15, 2), 2);
    	RETURN @Result
    END;
    GO
    -- =============================================
    -- Title:		IPAddress - String To IP
    -- Author:		Curtis Dibble
    -- Date:		12/14/2024
    -- Schema:		IPAddress
    -- Type:		Scalar Function
    -- Description:
    --	Convert a (Single) VARCHAR(45) IPv4 or 
    --	IPv6 Address to a VARBINARY(16) 
    --	Representation for storage or calculations
    -- =============================================
    
    CREATE FUNCTION [IPAddress].[ToBinary] (@IPString VARCHAR(45)) RETURNS VARBINARY(16) AS
    BEGIN
        IF @IPString IS NULL RETURN NULL;
    	DECLARE 
    		@Result			VARBINARY(16), 
    		@Token				VARCHAR(4),
    		@VBytes			VARBINARY(16) = 0x, 
    		@VBZone			VARBINARY(2),
    		@TempIPAddress		VARCHAR( 40 ),
    		@ColumnIndex			TINYINT,
    		@Delimiter			CHAR(1) = '.',
    		@PreviousColumnIndex	TINYINT = 0,
    		@Parts				TINYINT = 0, 
    		@Limit				TINYINT = 4;
    	-- Get position if IPV4 delimiter
    	SET @ColumnIndex = CHARINDEX( @Delimiter, @IPString );
    	-- If not IPV4, then assume IPV6
    	IF @ColumnIndex = 0
    	BEGIN
    		SELECT 
    @Delimiter = ':', 
    @Limit = 8, 
    @ColumnIndex = CHARINDEX( @Delimiter, @IPString );
    		-- Get number of parts (delimiters)
    		WHILE @ColumnIndex > 0
    			SELECT 
    			@Parts += 1, 
    			@ColumnIndex = CHARINDEX(@Delimiter,@IPString,@ColumnIndex+1);
    		SET @ColumnIndex = CHARINDEX(@Delimiter,@IPString);
    		IF @ColumnIndex = 0 RETURN NULL;
    	END
    	-- Add trailing delimiter (need new variable of larger size)
    	SET @TempIPAddress = @IPString + @Delimiter;
    		WHILE @ColumnIndex > 0
    		BEGIN
    			SET @Token = SUBSTRING(
    @TempIPAddress, 
    @PreviousColumnIndex + 1, 
    @ColumnIndex - @PreviousColumnIndex - 1);
    			IF @Delimiter = ':'
    			BEGIN
    				SELECT 
    				@VBZone = CONVERT( VARBINARY(2), 
    				RIGHT( '0000' + @Token, 4 ), 2 ), 
    				@VBytes += @VBZone;
    			-- Handles consecutive sections of zeros 
    -- representation rule (i.e.::) 
    -- (https://en.wikipedia.org/wiki/IPv6#Address_representation)
    				IF @Token = ''
    				WHILE @Parts + 1 < @Limit
    					SELECT @VBytes += @VBZone, @Parts += 1;
    			END
    			ELSE
    			BEGIN
    				SELECT 
    				@VBZone = CONVERT(VARBINARY(1), CONVERT(TINYINT,@Token)), 
    				@VBytes += @VBZone
    			END
    			SELECT 
    			@PreviousColumnIndex = @ColumnIndex, 
    			@ColumnIndex=CHARINDEX(@Delimiter,@TempIPAddress,@ColumnIndex+1) 
    		END
    		SET @Result = CASE @Delimiter WHEN ':' THEN @VBytes
            	ELSE /*0x000000000000000000000000 +*/ @VBytes 
    -- Return IPV4 addresses as 4 byte binary 
    -- (uncomment leading 0s section to force 16 byte binary)
    END 
    RETURN @Result
    END;
    GO
    -- =============================================
    -- Title:       	IPAddress - Addresses
    -- Author:		Curtis Dibble
    -- Date:		12/14/2024
    -- Schema:		IPAddress
    -- Type:		Table
    -- Description:
    --	Table to store unique IP addresses as 
    --	Binary (All) (VARBINARY(16))
    --	Indexed on Value to prevent duplicates
    -- =============================================
    CREATE TABLE [IPAddress].[Addresses] (
        [Id]            INT IDENTITY(1,1) PRIMARY KEY,
        [Value]         VARBINARY(16) NOT NULL
    );
    GO
    -- Index on IP Address to prevent duplicates
    CREATE INDEX IX_IPAddress_Addresses_Value ON [IPAddress].[Addresses]([Value]);
    GO
    

     

    Looking at the data that DShield honeypots produce, there seemed like a lot of opportunity to eliminate duplicate data – or at least reduce it without destroying records in the process. High five to Bryan Simon’s SEC401 course material (GSEC) for bridging the gap between what traverses the network, and what lives at the application/presentation layers, and another one to David Hoelzer for his work on SEC503 (GCIA).

    PostgreSQL and others come with their own built-in datatypes/handling for network addresses, but often enough, being vendor-agnostic is more important than native performance. You can usually throw more silicon at a problem, but asking for additional developers or more licenses has a funny tendency of ruffling the wrong people’s feathers. Starting with an older version of SQL Server forced me to look at the problem through that lens. The ToBinary and FromBinary functions, and the first iteration of the Addresses table were the initial successes after trying to get a toehold on how to parse these logs more efficiently without resorting to handling tens of millions of lines worth of text using command line tools. JQ and others work fine for experimentation, purpose-built document databases exist, and plenty of services have been built to solve this problem using every backend under the sun – but nothing quite compares to being able to walk away from a bad licensing deal or subscription without a care in the world. Again, I’m not a master of SQL, but I’ve run into enough problems over time that it looked like optimizing IP address storage could be a pattern worth exploiting in the name of speed. Knowing that IPv6 is slowly gaining traction, I wanted to make sure that any solution wouldn’t find itself abandoned and unloved in a few years due to gradual shifts in standards. We try not to redo perfectly good work in my house. 

    I’ve built other projects that handled large datasets with timeseries components. In the process of trying that many things, it meant putting every Python ORM (Object Relational Manager), web framework, and other toolkit through their paces, in an effort to find a magical abstraction to manage the complex SQL structures that always came up. Once, just to see what would happen, I left a webscraper running for several weeks. The program and backend would never be called ‘perfect,’ but they were definitely better than ‘best effort.’ During that period, it turned into an unwieldy (and nearly unusable!) 72GB SQLite3 database with about 40 tables. I still keep a copy as a reminder.

    One of the key choices to notice is the use of INT vs BIGINT as the datatype for the Id column. A more observant reader might notice that I used an integer identity column and Id in the first place, instead of checks, uniqueness constraints, or other - arguably better - practices. Those are entirely fair observations, and the use of an integer identity started as a consequence of ignorance. This consequence grows more severe when one realizes that the entire IPv4 address space is 2**32 unique numbers, the INT type can contain 2**32-1 unique values, and the IPv6 address space is 1028 times larger than either. In my defense, if any group of honeypots feeding data into this system encounter more than 4 billion unique clients, we all have far more serious problems to worry about. 

    Fear not! There is, in fact, a method to the madness. Humor me.
     

    -- =============================================
    -- Title:       	IPAddress - Insert IP
    -- Author:		Curtis Dibble
    -- Date:		12/14/2024
    -- Schema:		IPAddress
    -- Type:		Stored Procedure
    -- Description:
    --	Insert a (Single) VARCHAR(45) IPv4 or 
    --	IPv6 Address to the IP Address table and
    --	store it as a VARBINARY(16) item
    -- =============================================
    CREATE PROCEDURE [IPAddress].[InsertAddress] @IPString VARCHAR(45) AS
    BEGIN 
    INSERT INTO [IPAddress].[Addresses] ([Value]) 
    VALUES ([IPAddress].[ToBinary](@IPString));
    SELECT SCOPE_IDENTITY() AS NewID;
    END;
    -- =============================================
    -- Title:       	IPAddress - IPv4 As Binary
    -- Author:		Curtis Dibble
    -- Date:		12/14/2024
    -- Schema:		IPAddress
    -- Type:		View
    -- Description:
    --	View to retrieve IPv4 addresses from the 
    --	Address table as VARBINARY(16)
    -- =============================================
    
    CREATE VIEW [IPAddress].[IPv4AsBinary]
    	AS SELECT 
    	[Id],
    	[Value]	AS [Address]
    	FROM [IPAddress].[Addresses]
    	WHERE LEN([Value]) = 4;
    -- =============================================
    -- Title:       	IPAddress - IPv4 As String
    -- Author:		Curtis Dibble
    -- Date:		12/14/2024
    -- Schema:		IPAddress
    -- Type:		View
    -- Description:
    --	View to retrieve IPv4 addresses from the 
    --	Address table as VARCHAR(45)
    -- =============================================
    
    CREATE VIEW [IPAddress].[IPv4AsString]
    	AS SELECT 
    	[Id],
    	[IPAddress].[FromBinary]([Value]) AS [Address]
    	FROM [IPAddress].[Addresses]
    	WHERE LEN([Value]) = 4;
    -- =============================================
    -- Title:       	IPAddress - IPv6 As Binary
    -- Author:		Curtis Dibble
    -- Date:		12/14/2024
    -- Schema:		IPAddress
    -- Type:		View
    -- Description:
    --	View to retrieve IPv6 addresses from the 
    --	Address table as VARBINARY(16)
    -- =============================================
    
    CREATE VIEW [IPAddress].[IPv6AsBinary]
    	AS SELECT 
    	[Id],
    	[Value] AS [Address]
    	FROM [IPAddress].[Addresses]
    	WHERE LEN([Value]) = 16;
    -- =============================================
    -- Title:       	IPAddress - IPv6 As String
    -- Author:		Curtis Dibble
    -- Date:		12/14/2024
    -- Schema:		IPAddress
    -- Type:		View
    -- Description:
    --	View to retrieve IPv6 addresses from the 
    --	Address table as VARCHAR(45)
    -- =============================================
    
    CREATE VIEW [IPAddress].[IPv6AsString]
    	AS SELECT 
    	[Id],
    	[IPAddress].[FromBinary]([Value]) AS [Address]
    	FROM [IPAddress].[Addresses]
    	WHERE LEN([Value]) = 16;
    

     

    Using a set of VIEW and PROCEDURE objects to invisibly convert between VARBINARY(16) and VARCHAR(45), Create/Read/Update/Delete (CRUD) becomes dramatically easier, less error-prone, and open for automation with whatever procedural language you’re more comfortable with. I did this with Python using the Pandas package and pymssql database driver to manage connections. Golang, C#/.NET, Bash, PowerShell, Javascript or plain Java would work just as well, if not better. That’s part of the explanation, but it’s the part that misses the point. The point, is that the volume of data available to analyze is exploding at an exponential rate. There’s no universe where anyone has the time to sit and comb through tens of millions - or even billions -  of records by hand, let alone search for a needle in a sewing factory. 

     

    -- =============================================
    -- Title:       IPAddress - Subnets
    -- Author:		Curtis Dibble
    -- Date:		12/14/2024
    -- Schema:		IPAddress
    -- Type:		Table
    -- Description:
    --	Table to store Subnet addresses and associated
    --	Netmask values (in CIDR notation)
    -- =============================================
    CREATE TABLE [IPAddress].[Subnets] (
        [Id]            INT IDENTITY(1,1) PRIMARY KEY,
        [Value]         VARBINARY(16) NOT NULL,
        [CIDR]          TINYINT NOT NULL CHECK ([CIDR] <= 96),
        [Netmask]       VARBINARY(16) NOT NULL,
        [Bogon]         BIT NOT NULL DEFAULT 0,
        [Comment]       VARCHAR(128) NULL,
        CONSTRAINT UQ_IPAddress_Subnets_Address_CIDR_Netmask 
    UNIQUE CLUSTERED ([Value], [CIDR], [Netmask]),
    );
    -- =============================================
    -- Title:       	IPAddress - Bogons
    -- Author:		Curtis Dibble
    -- Date:		12/14/2024
    -- Schema:		IPAddress
    -- Type:		Table
    -- Description:
    --	Table to store Bogon IP addresses,
    --	aka: loopback or non-public ranges
    -- =============================================
    CREATE TABLE [IPAddress].[Bogons] (
        [Id]            INT IDENTITY(1,1) PRIMARY KEY,
        [Value]         VARBINARY(16) NOT NULL
    );
    -- Index on Bogon IP Address to prevent duplicates
    CREATE INDEX IX_IPAddress_Bogons ON [IPAddress].[Bogons]([Value]);
    
    -- =============================================
    -- Title:       	IPAddress - CloudProviders
    -- Author:		Curtis Dibble
    -- Date:		12/14/2024
    -- Schema:		IPAddress
    -- Type:		Table
    -- Description:
    --	Table to store unique Cloud Providers for
    --	fast lookups
    -- =============================================
    CREATE TABLE [IPAddress].[CloudProviders](
    	[Id]		INT IDENTITY(1,1) NOT NULL,
    	[Value] NVARCHAR (255) NOT NULL,
    PRIMARY KEY CLUSTERED ([Id] ASC),
    CONSTRAINT [UQ_IPAddress_CloudProviders_Name] 
    UNIQUE NONCLUSTERED ([Value] ASC) 
    );
    -- =============================================
    -- Title:		IPAddress - Cloud Provider Subnets
    -- Author:		Curtis Dibble
    -- Date:		12/14/2024
    -- Schema:		IPAddress
    -- Type:		Table
    -- Description:
    --	Store Reference Table of IP Ranges owned
    --	by various major cloud providers
    -- =============================================
    CREATE TABLE [IPAddress].[CloudProviderSubnets] (
    [Id]				INT IDENTITY(1,1) PRIMARY KEY,
    	[SubnetID]			INT NOT NULL,
    	[NetMask]			INT NOT NULL,
    [CloudProviderID]		INT NOT NULL,
    	[Comment]			NVARCHAR(255) NULL DEFAULT NULL,
    	CONSTRAINT UQ_IPAddress_CloudProviders_Ip 
    UNIQUE ([SubnetID], [CloudProviderID]),	
    -- Only one provider can own a range at a time...
    	CONSTRAINT CK_IPAddress_CloudProviders_NetMask 
    CHECK ([NetMask] >= 0 AND [NetMask] <= 128), 
    -- Not perfect, but better than nothing
    	CONSTRAINT FK_IPAddress_CloudProviderSubnets_Subnet 
    FOREIGN KEY ([SubnetID]) 
    		REFERENCES [IPAddress].[Subnets]([Id]),
    	CONSTRAINT FK_IPAddress_CloudProviderSubnets_CloudProvider 
    FOREIGN KEY ([CloudProviderID]) 
    		REFERENCES [IPAddress].[CloudProviders]([Id])
    );
    -- =============================================
    -- Title:       	Reference - RFC
    -- Author:		Curtis Dibble
    -- Date:		12/14/2024
    -- Schema:		Reference
    -- Type:		Table
    -- Description:
    --  Store Quick Lookup of RFC Info (All)
    -- =============================================
    CREATE TABLE [Reference].[RFC] (
        [Number]		INT PRIMARY KEY,
        [Name]		VARCHAR(127) NOT NULL DEFAULT '',
        [Description]	VARCHAR(255) NOT NULL DEFAULT '',
        CONSTRAINT UQ_Reference_RFC_Number_Name UNIQUE ([Number], [Name])
    );
    -- =============================================
    -- Title:       	Preload - RFC
    -- Author:		Curtis Dibble
    -- Date:		12/14/2024
    -- Schema:		Reference
    -- Type:		Data
    -- Description:
    --  Preload the RFC table
    -- =============================================
    INSERT INTO [Reference].[RFC] 
    	([Number], [Name], [Description]) 
    VALUES 
        (1045, 'VERSATILE MESSAGE TRANSACTION PROTOCOL', 'https://datatracker.ietf.org/doc/html/rfc1045'),
        (1112, 'Host Extensions for IP Multicasting', 'https://datatracker.ietf.org/doc/html/rfc1112'),
        (1142, 'OSI IS-IS Intra-domain Routing Protocol', 'https://datatracker.ietf.org/doc/html/rfc1142'),
    ?
    (998, NULL, 'https://datatracker.ietf.org/doc/html/rfc998');
    
    -- =============================================
    -- Title:       	Logs - Honeypot
    -- Author:		Curtis Dibble
    -- Date:		12/14/2024
    -- Schema:		Logs
    -- Type:		Table
    -- Description:
    --  Table containing all Honeypot Logs
    -- =============================================
    CREATE TABLE [Logs].[Honeypot](
    	[LogID]		INT IDENTITY(1,1) NOT NULL,
    	[Count]		INT NOT NULL,
    	[LogTimestamp]	DATETIME2(7) NOT NULL,
    	[SourceID]		INT NOT NULL,
    	[UrlID]		INT NOT NULL,
    	[UserAgentID]	INT NOT NULL,
        CONSTRAINT FK_Logs_Honeypot_Source 
            FOREIGN KEY ([SourceID]) 
            REFERENCES [IPAddress].[Addresses]([Id]),
        CONSTRAINT FK_Logs_Honeypot_UserAgent
            FOREIGN KEY ([UserAgentID]) 
            REFERENCES [Rolodex].[UserAgents]([Id]),
        CONSTRAINT FK_Logs_Honeypot_Url
            FOREIGN KEY ([UrlID]) 
            REFERENCES [Rolodex].[Urls]([Id])
    );
    CREATE INDEX IX_Logs_Honeypot_LogTimestamp 
        ON [Logs].[Honeypot]([LogTimestamp]);
    CREATE INDEX IX_Logs_Honeypot_SourceID 
        ON [Logs].[Honeypot]([SourceID]);
    -- =============================================
    -- Title:       	Logs - Firewall
    -- Author:		Curtis Dibble
    -- Date:		12/14/2024
    -- Schema:		Logs
    -- Type:		Table
    -- Description:
    --  Table containing all Firewall Logs
    -- =============================================
    
    CREATE TABLE [Logs].[Firewall](
    	[LogID]		INT IDENTITY(1,1) NOT NULL,
    	[LogTimestamp]	DATETIME2(7) NOT NULL,
    	[SourceID]		INT NOT NULL,
    	[SourcePort]		INT NOT NULL,
    	[TargetID]		INT NOT NULL,
    	[TargetPort]		INT NOT NULL,
    	[Protocol]		INT NOT NULL,
    CONSTRAINT FK_Logs_Firewall_Source FOREIGN KEY ([SourceID]) 
        REFERENCES [IPAddress].[Addresses]([Id]),
    CONSTRAINT FK_Logs_Firewall_Target FOREIGN KEY ([TargetID]) 
        REFERENCES [IPAddress].[Addresses]([Id]),
    CONSTRAINT FK_Logs_Firewall_Protocol FOREIGN KEY ([Protocol]) 
        REFERENCES [Reference].[Protocols]([Number]) 
    ); 
    -- =============================================
    -- Title:       	Logs - RawSSH
    -- Author:		Curtis Dibble
    -- Date:		12/14/2024
    -- Schema:		Logs
    -- Type:		Table
    -- Description:
    --  Table containing all SSH Logs
    -- =============================================
    
    CREATE TABLE [Logs].[RawSSH](
    	[LogID]		INT IDENTITY(1,1) NOT NULL,
    	[LogTimestamp]	DATETIME2(7) NOT NULL,
    	[SourceID]		INT NOT NULL,
    	[SourcePort]		INT NOT NULL,
    	[UsernameID]		INT NOT NULL,
    	[PasswordID]		INT NOT NULL,
        CONSTRAINT FK_Logs_RawSSH_Source 
            FOREIGN KEY ([SourceID]) 
            REFERENCES [IPAddress].[Addresses]([Id]),
        CONSTRAINT FK_Logs_RawSSH_Username
            FOREIGN KEY ([UsernameID]) 
            REFERENCES [Rolodex].[Usernames]([Id]),
        CONSTRAINT FK_Logs_RawSSH_Password
            FOREIGN KEY ([PasswordID]) 
            REFERENCES [Rolodex].[Passwords]([Id]),
    );
    CREATE INDEX IX_Logs_RawSSH_LogTimestamp 
        ON [Logs].[RawSSH]([LogTimestamp]);
    CREATE INDEX IX_Logs_RawSSH_SourceID 
        ON [Logs].[RawSSH]([SourceID]);
    
    -- =============================================
    -- Title:      	Rolodex - Fingerprint
    -- Author:		Curtis Dibble
    -- Date:		12/14/2024
    -- Schema:		Rolodex
    -- Type:		Table
    -- Description:
    --  Table to store unique fingerprints as binary (All)
    -- =============================================
    CREATE TABLE [Rolodex].[Fingerprints] (
    	[Id]	INT IDENTITY(1,1) PRIMARY KEY NOT NULL,
    	[Value]	VARBINARY(16) NOT NULL UNIQUE
    );
    -- Index on Fingerprint to prevent duplicates
    CREATE INDEX IX_Rolodex_Fingerprints_Value ON [Rolodex].[Fingerprints]([Value]);
    GO
    -- =============================================
    -- Title:       	Rolodex - Fingerprint To Binary 
    -- Author:		Curtis Dibble
    -- Date:		12/14/2024
    -- Schema:		Rolodex
    -- Type:		Scalar Function
    -- Description:
    --	Convert a fingerprint of form 
    --	'01:23:45:67:89:ab:cd:ef:01:23:45:67:89:ab:cd:ef' 
    --	to binary
    -- =============================================
    CREATE FUNCTION [Rolodex].[FingerprintToBinary]
    (@Fingerprint VARCHAR(255))
    RETURNS VARBINARY(16)
    AS
    BEGIN
        DECLARE @Clean VARCHAR(32);
        -- Remove colons and ensure lowercase
        SET @Clean = LOWER(REPLACE(@Fingerprint, ':', ''));
        -- Convert hex string to binary
        RETURN CAST('' AS XML).value('xs:hexBinary(sql:variable("@Clean"))', 'VARBINARY(16)');
    END;
    -- =============================================
    -- Title:		Rolodex - Fingerprint From Binary 
    -- Author:		Curtis Dibble
    -- Date:		12/14/2024
    -- Schema:		Rolodex
    -- Type:		Scalar Function
    -- Description:
    --	Convert a binary representation of a 
    --  fingerprint back to form 
    --  '01:23:45:67:89:ab:cd:ef:01:23:45:67:89:ab:cd:ef' 
    --  from binary
    -- =============================================
    CREATE FUNCTION [Rolodex].[FingerprintFromBinary]
    (@Binary VARBINARY(16))
    RETURNS VARCHAR(47)
    AS
    BEGIN
        DECLARE @HexString VARCHAR(32);
        -- Convert binary to hex string
        SET @HexString = LOWER(
            CAST('' AS XML).value(
    'xs:hexBinary(sql:variable("@Binary"))', 
    'VARCHAR(32)')
    );    
        -- Insert colons every 2 characters
        RETURN STUFF(
            (
                SELECT ':' + SUBSTRING(@HexString, Number, 2)
                FROM (
                    SELECT ROW_NUMBER() OVER (ORDER BY object_id)
                    FROM sys.objects
                ) AS Numbers(Number)
                WHERE Number % 2 = 1 
                AND Number <= LEN(@HexString)
                FOR XML PATH('')
            ), 1, 1, ''
        );
    END;
    

     

    So, what makes INT IDENTITY(1,1) PRIMARY KEY special, if it’s such a bad practice? What’s the use of having a lookup table for RFC numbers, or an entire schema called Reference that exists solely to provide constants or points of reference that attach data to real world standards, or observed reality? Using the keyword IDENTITY(1,1) ties a piece of data to a specified (and indexed!) location on disk, causing faster read performance. The point of having data in the first place is to make use of it. The point of having a highly regimented data warehouse adhering to strict standards is to ensure a level of confidence in the results of analysis. The point of having fast read performance and memory optimized data structures sitting in that data warehouse is to put a GPU to work and use the machine learning tools refined over the last 20 years to solve real world problems. Automate the boring stuff, right?

    Those are some bold statements. Looking at the hardware available, by using INT instead of BIGINT, this enables the use of the highest efficiency training methods to identify patterns in honeypot logs, while supporting the fastest reads possible. More importantly, it enables amateur sleuths to skip building servers in favor of simply performing analysis or deep learning tasks on the same commonly available commodity hardware they used to read the paper this morning. A 32 bit address space of about 4 billion possibilities sounds like a lot, and would be incompatible with the comparatively limited size of a 16 bit floating point type, except that many of the reference tables contain significantly fewer than 65,535 entries. The tables which do contain more can be windowed and remapped to that size.

    Even an IPv6 address can be mapped to a series of TINYINT columns in a Pandas dataframe before being fed into Dask. 

    This might sound far-fetched, but that’s the direction the world is headed: AI/ML models being trained or performing live tasks at the edge, or more commonly, in local devices as opposed to training and executing on purpose built servers in cloud datacenters. Intel has invested heavily in this with the OpenVINO toolkit [12], which currently supports INT8, and partially supports INT4. The target market they’re shaping appears to be FPGA-centric, since Intel retains a competitive advantage in the segment despite Xilinx’s decades of success and innovation in that space. The competitors (AMD, Nvidia) are more focused on white hot GPU sales as it is, and FPGA offers an impressive combination of speed, computational capability, and minimal power requirements despite the current price issues with that class of silicon.

    In The Meantime…

    Now, it wouldn’t make any sense to go to all this effort without a good visualization or monitoring option, which is why I used Grafana [14]. Half of that is because it’s a platform I’ve worked with for years, the other half is that it had prebuilt integrations for the exact set of tools  used in this project, with options to trade them out relatively easily, should the underlying infrastructure change. 

    Since the machine learning and advanced analytics components don’t exist yet, there has to be some way to extract value from all this effort, right? To start, we’ve covered the database extensively, but not the continuous performance metrics, and what kind of load it places on a system by humming in the background. Here’s some screenshots of the freely available SQL Server monitoring dashboard in Grafana Open Source. When this was captured, I was in the middle of loading several million log entries, and the upsert commands are causing table locks across the board, leading to degraded performance. 

    For the unfamiliar, an upsert procedure is a way to protect the integrity of your data at the cost of some performance. In the process of building this project, I relied on them as a guardrail to minimize, if not fully prevent duplicate entries in the log tables while the model was still under construction and testing. The basic formula of the operation is to accept a set of parameters, check them against various tables or constants, insert the data into the correct place, and return some identifier for the record. If a conflicting or duplicate entry is found, return the identifier instead. Here’s the bare basic example to upsert an IP Address to the table defined earlier that has acted as a great guardrail for me during testing.
     

    -- =============================================
    -- Title:       	IPAddress - Upsert IP Address
    -- Author:		Curtis Dibble
    -- Date:		12/14/2024
    -- Schema:		IPAddress
    -- Type:		Stored Procedure
    -- Description:
    --	Upsert a (Single) VARCHAR(45) IPv4 or 
    --	IPv6 Address to the IP Address table and
    --	store it as a VARBINARY(16) item
    -- =============================================
    
    CREATE PROCEDURE [IPAddress].[UpsertAddress] @Value VARCHAR(45),@ID INT OUTPUT AS
    BEGIN 
        SET NOCOUNT ON;
        DECLARE @IPBinary VARBINARY(16) = [IPAddress].[ToBinary](@Value);
        -- Try to find existing IP
        SELECT @ID = [Id] 
        FROM [IPAddress].[Addresses] 
        WHERE [Value] = @IPBinary;
        -- Insert if not found
        IF @ID IS NULL
        BEGIN
            INSERT INTO [IPAddress].[Addresses] ([Value]) VALUES (@IPBinary);
            SET @ID = SCOPE_IDENTITY();
        END
    END;
    

     

    That’s most of the relevant code. The rest can be found on GitHub from the link in the references. The overall idea here was to leave the boring and repetitive stuff up to the computer and get back to finding bad actors on the open web. So before embarking on building your own version or trying to reinvent the wheel, here’s some previews of what exists, and a few ideas of what’s possible with a local install of Grafana and what’s been presented so far.

    The first item on any professional’s list is keeping an eye on their tools, and knowing what’s happening in real time. This is the freely available SQL Server monitoring dashboard for Grafana. It significantly outperforms the built in SQL Server Management Studio monitoring tools from what I’ve seen.


    Figure 12: Main dashboard for SQL server monitoring. Few things in life are better than the ability to get the information you need in a single place, without having to shift gears to find it [15].

     

    Considering that the database is monitored now, the question was always how does this scale to handle more than one honeypot, or even a list of them, spread across multiple sites or cloud providers? Fortunately, when it comes to the major cloud providers, other professionals have put in the time and effort to release high quality observability tools. This is the freely available Azure Virtual Machine and Infrastructure monitoring dashboard. Just like the SQL Server dashboard, adding it to a Grafana instance is essentially copying/pasting a single string into the web interface.

     


    Figure 13: For my instance, the honeypot was setup on Azure, and there’s a conveniently available infrastructure and virtual machine monitoring dashboard available. [16] A word of caution, Azure permissions are not the easiest thing to configure when self hosting a Grafana instance. This includes a preview of the stats provided. A similar tool exists for most major operators.

     

     


    Figure 14: Network statistics provided by Azure. 

     


    Figure 15: Disk I/O statistics provided by Azure.

     


    Figure 16: CPU Usage statistics provided by Azure.

     

    All of that shows our tools, but to see what adversaries are up to, we actually have to use the database and load it with some historical data. It will trickle in eventually as it’s loaded, but seeing what a well populated instance looks like is always a useful way to benchmark your own.

     


    Figure 17: One of the most interesting aspects of building this from scratch, was being able to experiment with other people’s implementations of classic visualizations. The most effective, and best looking at scale has to be the heatmap. This is a breakdown of Cowrie Input/Success/Failed Command Logs across Azure, AWS, GCP from 10/31/24-11/7/24. Notice the clean break not long after Halloween?

     


    Figure 18: Each column can be filtered, which is done Grafana-side using the cache and limiting query volume. Not important for the first several hundred thousand records. Incredibly important after several million.

    Cui Bono?

    To anyone that has read this far, thank you. Because now we have to figure out who benefits from this endeavor. Given all of the above, many of the more advanced features still need to be built, and this is more of a foundation for a change than a revolution on its own. I’m short on time, and releasing this to the community as a starting point to follow whatever hunch they’ve been meaning to run down. Maybe someone can solve some hard problems that are completely unrelated. I have no idea what will come of it, but if nothing else, it’s my goal that someone can use this as a jumping off point to learn a new thing.

    Final Thoughts

    Before anyone jumps to conclusions about having contradicted myself earlier, and the perils of vendor lock-in leaving Microsoft as the big winner in this, what I have written so far is explicitly designed to stay under the 10GB SQL Server Express limit for as long as possible. It’s as close to the ANSI standard [17] as I was able to figure out with relatively limited time, and that’s for a fairly decent reason. Cybersecurity is hard. Picking patterns out of chaos is mentally taxing, and emotionally exhausting. The ANSI standard is important, because it means that this will translate to other relational database systems, such as vanilla PostgreSQL, or allow use of some of the more interesting extensions like TimescaleDB [18] and PG Vector [19], to enable machine learning operations without building interfaces by hand, or suffering the overhead of exporting data and re-vectorizing before building training loops in PyTorch/Tensorflow. 

    That all deals with scaling up to industrial operations, but what if we went the other direction, and scaled down? The team at Turso undertook a rewrite of SQLite3 in Rust to bring in some of the more modern features such as async/await patterns, and a more open test-suite in the libSQL project [20]. I see no reason why there couldn’t be a default or pre-baked SQLite3/libSQL database as a single file full of known urls, endpoints, and other gems if for no other reason than to simplify the nightmare of building a unique extract-transform-load pipeline for every new document type containing indicators of compromise, especially for ones that don’t happen to conform to STIX/TAXII/Yara schemas. 
    To close this out, I leave translations, optimizations, and general experimention up to the end user. I hope it will be beneficial or useful to someone out there. Thanks for following along.


     

    References

    1. https://www.sans.edu/cyber-security-programs/bachelors-degree/
    2. Dell - Alienware m16 R2 Laptop - https://www.dell.com/en-us/shop/dell-laptops/alienware-m16-r2-gaming-laptop/spd/alienware-m16-r2-laptop 
    3. In Defense of Not-Invented-Here Syndrome https://www.joelonsoftware.com/2001/10/14/in-defense-of-not-invented-here-syndrome/ 
    4. Primeagen - Not Invented Here Syndrome – https://youtu.be/uAFy0x3LRBU?si=1tDvaFKGGTNjWg8o 
    5. SANS ISC - Using Zeek, Snort, and Grafana to Detect Crypto Mining Malware https://isc.sans.edu/diary/Guest+Diary+Using+Zeek+Snort+and+Grafana+to+Detect+Crypto+Mining+Malware/31472/ 
    6. GitHub - DShield-SIEM - https://github.com/bruneaug/DShield-SIEM/tree/main 
    7. SANS ISC - DShield Sensor Monitoring with a Docker ELK Stack - https://isc.sans.edu/diary/DShield+Sensor+Monitoring+with+a+Docker+ELK+Stack+Guest+Diary/30118 
    8. Elastic - ELK Stack - https://www.elastic.co/elastic-stack 
    9. IBM - QRadar - https://www.ibm.com/qradar 
    10. O’Reilly - Joe Celko’s SQL for Smarties - https://www.oreilly.com/library/view/joe-celkos-sql/9780128007617/ 
    11. O’Reilly - Joe Celko’s Trees and Hierarchies in SQL for Smarties - https://www.oreilly.com/library/view/joe-celkos-trees/9780123877338/ 
    12. Microsoft - SQL Server Technical Documentation - https://learn.microsoft.com/en-us/sql/sql-server/?view=sql-server-ver16 
    13. Intel – OpenVINO Toolkit - https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/overview.html 
    14. Grafana - OSS - https://grafana.com/oss/grafana/ 
    15. Grafana – Microsoft SQL Server Dashboard - https://grafana.com/grafana/dashboards/21378-microsoft-sql-server-dashboard/ 
    16. Grafana - Azure Virtual Machine Dashboard - https://grafana.com/grafana/dashboards/10532-azure-virtual-machine/ 
    17. ANSI – SQL Standard ISO/IEC 9075:2023 - https://blog.ansi.org/sql-standard-iso-iec-9075-2023-ansi-x3-135/ 
    18. GitHub - Timescale DB PostgreSQL Extension - https://github.com/timescale/timescaledb 
    19. GitHub - PG Vector PostgreSQL Extension - https://github.com/pgvector/pgvector 
    20. GitHub - libSQL SQLite Rewrite - https://github.com/tursodatabase/libsql 
    21. Github - The Code - https://github.com/maxwell2011/DShield-Database/tree/master 

     

    --
    Jesse La Grew
    Handler

    0 comment(s)

      Comments


      Diary Archives